Compositional Features of Eukaryotic Genomes for Checking Predicted Genes

نویسندگان

  • Stéphane Cruveiller
  • Kamel Jabbari
  • Oliver Clay
  • Giorgio Bernardi
چکیده

Gene prediction relies on the identification of characteristic features of coding sequences that distinguish them from non-coding DNA. The recent large-scale sequencing of entire genomes from higher eukaryotes, in conjunction with currently used gene prediction algorithms, has provided an abundance of putative genes that can now be analysed for their compositional properties. Strong, systematic differences still exist, in several species, between the compositional properties of sets of ex novo predicted genes and genes that have been experimentally detected and/or verified. This is particularly evident in the estimated gene set (>45,000 genes) of the recently sequenced rice genome, where roughly half the predicted genes are compositionally unusual and have no known orthologues in the dicot Arabidopsis. In a few cases such differences might suggest a bias in experimental gene-finding protocols, but the quasi-random nature of the compositionally aberrant predicted genes is a strong indication that many, if not most, of them are false positives. It therefore appears that some important features of coding regions have not yet been taken into account in existing gene prediction programs. Statistical base compositional properties of curated gene data sets from vertebrates, which we briefly review here, should therefore provide a useful benchmark for fine-tuning probabilistic gene models and model parameters that are currently in use.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

GC-Profile: a web-based tool for visualizing and analyzing the variation of GC content in genomic sequences

In order to understand the evolution, structure and function of genomes, it is important to know the general compositional features of DNA sequences. Based on the quadratic divergence, a new segmentation algorithm to partition a given genome or DNA sequence into compositionally distinct domains has been put forward. With the aid of the technique of cumulative GC profile, the distribution of seg...

متن کامل

Compositional analysis of non-coding regions in eukaryotic genomes

Whereas, in the last years, many efforts have been devoted to locate genes within genomes, relatively few tools have been developed to identify the regulatory regions required for the correct transcriptional activity of the genome. This task is particularly difficult in the case of eukaryotic organisms for which regulatory regions represent a small percentage, overwhelmed by –presumablynon-func...

متن کامل

Evolutionary History of the Human Genome

Compositional genomics is an approach to the problem of the organization of eukaryotic genomes. Initially this approach consisted of analysing the base composition of complex eukaryotic genomes (such as the nuclear genomes of vertebrates) by using density gradient centrifugation in the presence of sequence-specific ligands. This approach revealed that vertebrate genomes are mosaics of very long...

متن کامل

Analysis of GC-compositional Strand Bias in the Transcription Start Sites of Plant and Fungal Genes

In a recent paper, a GC-compositional strand bias, or GC-skew (=(C-G)/(C+G)) was reported, where C and G denote the numbers of cytosine and guanine residues, respectively, near the transcription start sites (TSS) in Arabidopsis [4]. However, it is unclear whether other eukaryotic species have similar GC-skews, and the biological meaning of that remains unknown. In this study, we conducted compa...

متن کامل

Assessment of compositional heterogeneity within and between eukaryotic genomes.

Using large amounts of long genomic sequences, we studied the compositional patterns of eukaryotic genomes. We developed a simple measure, the compositional heterogeneity (or variability) index, to compare the differences in compositional heterogeneity between long genomic sequences. The index measures the average difference in GC content between two adjacent windows normalized by the standard ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Briefings in bioinformatics

دوره 4 1  شماره 

صفحات  -

تاریخ انتشار 2003